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Abstract. Brad Efron's paper has inspired a return to the ideas be- 
hind Bayes, frequency and empirical Bayes. The latter preferably would 
not be limited to exchangeable models for the data and hyperparam- 
eters. Parallels are revealed between microarray analyses and profiling 
of hospitals, with advances suggesting more decision modeling for gene 
identification also. Then good multilevel and empirical Bayes models 
for random effects should be sought when regression toward the mean 
is anticipated. 

Key words and phrases: Bayes, frequency, interval estimation, ex- 
changeable, general model, random effects. 



1. FREQUENCY, BAYES, EMPIRICAL BAYES 
AND A GENERAL MODEL 

Brad Efron's two-groups approach and the empir- 
ical null ("null" refers to a distribution, not to a 
hypothesis) extension of his local fdr addresses test- 
ing many hypotheses simultaneously, with model- 
ing enabled by the repeated presence of many simi- 
lar problems. He assumes two- level models for ran- 
dom effects, developing theory by drawing on and 
combining ideas from frequency, Bayesian and em- 
pirical Bayesian perspectives. The last half-century 
in statistics has seen exciting developments from 
many perspectives for simultaneous estimation of 
random effects, but there has been little explicit par- 
allel work on the complementary problem of hypoth- 
esis testing. That changes in Brad's paper, especially 
for testing many hypotheses when exchangeability 
restrictions are plausible. 

"Empirical Bayes" is in the paper's title, said in 
Section 3 to be a "bipolar" methodology that draws 
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on frequency and Bayes, but otherwise with a mean- 
ing left for us to infer from the paper's example 
datasets. The examples all involve two- level mod- 
els with inferences about many unknown param- 
eters, that is, about the unknown random effects. 
Blending frequency and Bayesian thinking in statis- 
tics will be appreciated especially by statisticians 
who engage both in theoretical and in applied re- 
search, and we know that many of statistics' best 
and time-honored procedures perform well simulta- 
neously from the frequency and the Bayesian per- 
spectives. Classifying statisticians as either Bayesian 
or frequentist ignores the fact that these terms have 
varying meanings to different statisticians, and it en- 
courages the view that statisticians must adopt just 
one of these perspectives exclusively, which many 
statisticians, myself included, do not do. 

The frequency perspective requires comparing pro- 
cedures on the basis of repeated sampling, but it can 
be neutral about how procedures are constructed. 
The Bayesian approach, after a model is completely 
specified, including the "prior" ("structural" or "mix- 
ing" might be better adjectives) distribution, must 
use the laws of probability to assess uncertainties 
about unknowns, given the observed data and the 
model. Valuably even from the frequency perspec- 
tive, Bayesian reasoning can be used to suggest how 
to construct inferences about population parameters 
and other unobservables, at least in ideal settings. 
That is illustrated in Efron's treatment of the fdr 
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and the Fdr here. Such modeling of likelihood func- 
tions at more than one level, and of priors, becomes 
less subjective when one has more data, especially 
with massive datasets like those in the paper. 

Scientists have been encountering ever more mas- 
sive datasets, especially since modern censoring tech- 
nology and computers have evolved to make col- 
lecting, organizing, visualizing and analyzing such 
databases possible. My early experience in the 1980s 
involving two-level models for NASA's satellite im- 
agery data made it clear to me that science had 
reached a new point where computers not only had 
enabled us to analyze large datasets, they even made 
it possible to collect very large datasets. The com- 
puter had become a horse that could collect and 
analyze data as we directed, and statisticians were 
its jockeys. While the massive datasets we see today 
can be overwhelming, Brad rightly recognizes that 
they can be welcomed as opportunities to build bet- 
ter models. That not only leads to more accurate 
inferences for the given data, but better models also 
advance knowledge and future scientific discoveries. 

Brad's use of "empirical Bayes" with the six datasets 
of the paper is restricted to datasets he considers 
to be exchangeable. That could signify his moving 
away from a liberalizing view of empirical Bayes that 
we once developed together. I doubt this, but the 
analyses shown assume that the joint distribution of 
the data and of the random effects are exchangeable. 
Our papers together in the 1970s moved empirical 
Bayes away from that requirement, partly to provide 
a perspective from which acceptable shrinkage gen- 
eralizations of Stein's estimator might be developed. 
That was and is needed especially when (nearly) un- 
biased estimates of the different random effects have 
different variances, perhaps most often because of 
different sample sizes. 

In seeking a firmer basis for modeling and in- 
ference in empirical Bayes settings (Morris, 1983), 
I continued back then to use that term. However, 
Herb Robbins, who coined the term, made it clear 
to me then that the version he had pioneered, built 
around exchangeability, asymptotics and nonpara- 
metric mixing (prior) distributions, was how he wanted 
the term to be used. Also about then, D. V. Lindley 
averred that "There is no one less Bayesian than an 
empirical Bayesian," a comment that seemed mainly 
directed at Robbins' approach. Some other statisti- 
cians then, and perhaps still today, thought of em- 
pirical Bayes as restricted to plugging hyperparame- 
ter estimates into Bayes rules. So the term "empiri- 



cal Bayes" meant different things to different statis- 
ticians, and not always good things. 

It also had become clear to me back then that 
dealing with many inferences simultaneously had 
to be guided by Bayesian reasoning. For example, 
Bayesian constructions show why interval estimates 
based on plug-in methods can be much too narrow, 
especially when the number (iV, in the notation of 
Brad's paper) of random effects being estimated is 
small or moderate. So I began to use the term empir- 
ical Bayes more sparingly to describe my own work. 
In building on the ideas behind my 1983 paper, and 
when trying to combine frequency ideas with Bayes 
in hierarchical models, I sometimes have referred to 
a "general model for statistics" for the desired fre- 
quency/Bayes unification. 

The general model includes distributions for data 
given parameters of interest, and for the hyperpa- 
rameters that govern the distribution of those pa- 
rameters, conceptually (but not always) specified for 
at least two hierarchical levels. From the frequency 
perspective in this general model, all possible distri- 
butions would be considered for the hyperparame- 
ters, those being mixtures of atomic (Dirac) distri- 
butions. From the subjective Bayesian perspective, 
just one distribution (a prior at the top level of the 
hierarchy) would be allowed in a particular inferen- 
tial problem. (This framework extends to nonpara- 
metric models by letting the parameters and/or hy- 
perparameters be infinite dimensional.) 

This general model puts frequency and Bayesian 
models at the endpoints of a continuum, with the 
middle span open for flexibly specifying restrictions 
on distributions that could accommodate empiri- 
cal Bayes and other models. Decision theory ex- 
tends to this general model so that frequency (re- 
sampling) evaluations would be done conditionally 
for the range of hyperparameters. Such resampling 
was carried out when evaluating the coverage prob- 
abilities of parametric empirical Bayes interval es- 
timates in Morris (1983) and in much other work 
since then. In a University of Texas dissertation, Joe 
Hill showed how this general framework extended to 
ancillarity, information, and other fundamental sta- 
tistical ideas (Hill, 1986, 1990). 

Aside from their different interpretations, the fre- 
quency and Bayesian perspectives can be quite com- 
plementary. The frequency paradigm is normative, 
but not necessarily prescriptive. The fundamental 
theorem of (frequency) decision theory, that is, the 
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complete class theorem, supports the Bayesian con- 
nection by recognizing that the admissible proce- 
dures nearly coincide with the class of extended Bayes 
rules. Statistical procedures with good repeated sam- 
pling (frequency) properties often can be anticipated 
by thinking about Bayesian constructions. 

A reminder of how Bayesian procedures can have 
better frequency properties than those derived solely 
by frequency reasoning is illustrated by a graph with 
N = 15 in Christiansen and Morris (1997, Figure 1). 
Poissonly distributed summary data like those seen 
at heart transplant hospitals are fitted there via two- 
level models. The graph there shows the coverage 
rates in repeated sampling of nominal 95% inter- 
vals when the transplant success rates are simul- 
taneously estimated at the different hospitals. Six 
procedures are evaluated. Two follow Bayesian con- 
structions, one that uses the BUGS program and 
default prior, and the other being an accurate ap- 
proximation of a hierarchical Bayes procedure based 
on a hyperparameter prior akin to Stein's super- 
harmonic prior for Normal distributions. These two 
Bayesianly motivated interval procedures cover or 
nearly cover 95% of the time in repeated sampling 
simulations, as intended. The four frequency pro- 
cedures based on MLE, REML and on two GLM 
multilevel techniques, have coverages in the range 
of 60% to 90%, falling well below the claimed cover- 
age rate of 95%. Whether developed from Bayesian 
or frequency considerations, good frequency proce- 
dures must provide coverages in repeated sampling 
close to their claimed values, but the four non-Bayesian 
procedures do not meet that standard. 

2. FDR, FDR AND EXCHANGEABILITY 

Brad illustrates the use of Bayesian modeling and 
probabilistic reasoning with his six large datasets 
to produce approaches to hypothesis testing that 
would be valid if prior information were available. 
Then he shows how to estimate the needed prior, or 
mixing, distributions from repeated data. 

Probabilistic modeling leads directly to Efron's 
local fdr, which in turn leads to the Benjamini- 
Hochberg Fdr procedure. Starting with the simplest 
"two-groups" model, with density /o under the null 
hypothesis Hq and fi under the alternative hypoth- 
esis Hi, the paper moves through increasingly elab- 
orate probability models discovered in the process 
of modeling and analyzing exchangeable data and 
repeated problems. Benjamini and Hochberg's cele- 
brated false discovery rate statistic Fdr applies when 



all the Ho distributions have a single theoretically 
determined density function fo, and when the prior 
probability po of Hq is high (at least 0.9). Then 
/i, the H\ density, is available via estimating the 
marginal density, f(z) = po * fo(z) + pi * fi(z) and 
solving for fi(z). While fi is not actually needed in 
exchangeable cases, it will be for a nonexchangeable 
extension which I will review later. Thus, a direct es- 
timate of the posterior probability of Hq, given the 
data, only requires pq, fo and f(z) in this simplest 
case. 

This approach is beguilingly simple, but its valid- 
ity depends crucially on a restrictive exchangeability 
assumption that can be missed. The marginal den- 
sity f(z) will be the same for all the Zi observations 
only if the same fi distribution holds under Hi for 
all Zi, i = 1, . . . , N. This may hold for five of the six 
datasets in the paper, but it does not for the school 
data, as discussed later. 

As formula (2.7) shows, the local fdr is the poste- 
rior probability of Hq, that is, 

fdr (z) = P{H \Z = z) = — — ——. 

Po * fo(z) + pi * fi(z) 

Starting with fdr(z) before introducing Fdr(z) seems 
natural, but this particular history has developed 
oppositely. Efron's local fdr is immediately inter- 
pretable in probabilistic or Bayesian terms because 
choosing between hypotheses Hq and Hi means con- 
sidering P{Hq\z), and also because fdr depends on 
the likelihood ratio, and on the Neyman-Pearson 
statistic. 

As Brad writes, the Benjamini-Hochberg Fdr statis- 
tic (2.3) is the integral of fdr(z). Starting with fdr(Z) = 
P{Hq\Z) and assuming that one will choose Hi when- 
ever Z < z leads to 

E(fdv(Z)\Z <z) = P(Hq\Z <z) = Fdr(z), 

as shown in the paper, and this is 

Pq*Fq{z) 



Fdr(z) 



Pq *F (z) +p 1 *Fi(z)' 

Thus, Fdr(z) = P{Hq\Z < z) is the fraction of times 
that Hq would be falsely rejected. The Benjamini- 
Hochberg false discovery rate Fdr(z) is discovered 
probabilistically as the average probability (the pre- 
posterior probability in Bayesian terms) of accept- 
ing, that is, discovering, Hi falsely. 

The probability model that leads to the fdr and 
Fdr statistics in repeated applications assumes ex- 
changeability in two ways. First, po should not de- 
pend on i, as Efron discusses in Section 2. Sec- 
ond, /o and fi must be the same for all problems 
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1 = 1,2, ... ,N. From the two-level modeling perspec- 
tive of the paper, fi(z) is a mixture of densities for 
the (approximately) N * p\ values of /ij that are dis- 
tributed according to H\. Denoting the random ef- 
fects as (ii for i = 1, . . . ,N, exchangeability permits 
the conditional densities f(zi\fjii) for Z{ to depend on 
% through fa only, and not otherwise to depend on 
i. 

Some two-level settings are modeled with "paired" 
exchangeability among individuals [i.e., the collec- 
tion of pairs (zj, /Xj) are exchangeable], and that pro- 
duces exchangeability for the marginal distributions 
of Z{. This happens familiarly with iV independent 
individuals (in the paper, "individuals" are the N 
genes, and the schools, etc.) if the joint distribu- 
tions of (zi,fJ,i) are i.i.d. Robbins' original introduc- 
tion of empirical Bayes for Poisson models rested 
on paired exchangeability because every individual 
Poisson distribution was assumed in his paper to 
have the same exposure. The James-Stein estima- 
tor arises as a parametric empirical Bayes estima- 
tor, but only when paired exchangeability holds, as 
when the sample means all have the same variances. 

A happy consequence of pairwise exchangeability 
is that Bayesian procedures often can conveniently 
be expressed explicitly in terms of the marginal (un- 
conditional) distribution of the data (zi), and that 
marginal can be estimated directly from the ob- 
served Efron has done in several settings. This 
gives an asymptotically consistent estimate of a Bayes 
procedure, and the statistician then can avoid di- 
rectly estimating the mixing distribution g(-) that 
governs the random effects, Hi- Relatively simple ex- 
pressions then may emerge, such as the procedures 
of Robbins, of Stein, and of Benjamini-Hochberg. 
As Efron notes, the independence assumption is not 
crucial, but exchangeability is. The Fdr and fdr statis- 
tics in the exchangeable setting of Efron's Section 

2 should work well with pairwise exchangeability 
when ./V is large, but exchangeability can be restric- 
tive and may depend heavily on prior knowledge. 
Seemingly, exchangeability is widely considered to 
hold for microarray, proteomics, BRCA and spec- 
troscopy data. It cannot be valid for the school data 
because school enrollments, that is, sample sizes rii 
vary. Nearly all theory presented in this paper is 
based on such exchangeability, barring the discus- 
sion of nonexchangeable choices for pq in Section 
2. Is "empirical Bayes" in this paper meant to be 
limited to exchangeable (or pairwise exchangeable) 
settings? 



3. MULTIPLE HYPOTHESIS 
TESTING— LOOKING FOR LARGE RANDOM 
EFFECTS 

Here is an extension of Efron's approach that may 
be especially useful for identifying large random ef- 
fects Hi. First consider and fix any single value of 
i, 1 < i < N, with z = zi having been observed, and 
assume that the "theoretical null" N(0, 1) distribu- 
tion holds for Zi under Hq, that is, when the ran- 
dom effect [j, = fa = 0. Assume pq, Jq and f± all are 
known for this value i, as in Section 2, and that 
g(-) is known. Then fi(z), the marginal distribu- 
tion of z under H\ , is determined by integrating the 
conditional distribution of z given /x, for example, 
z ~ N(fi, 1) having density 4>(z — /z), over the dis- 
tribution g(fi) that governs the H\ distribution of 
\x. (Exchangeability does not matter when all these 
distributions are known.) Then when H\ holds, the 
density of fi given z is 

h(n\z) = 4>{z -(j,)* g(fi)/fi(z). 

With fdr(z) = P(/i = 0\z), and writing 5(h) as the 
Dirac delta function (/fx = with probability 1 when 
Hq is true), the density of h given z is expressible 
mixture of Efron's fdr(z) according to 

p(n\z) = idr(z) * 5(h) + (1 - fdr(z)) * h(n\z). 

If all these distributions and values were known, one 
could "test" Hq : /x = (or fi < 0?) versus H x : fi > 
by using fdr(z) as the probability of Hq. However, 
one well might prefer only to identify genes "far from 
Hq" that is, only select values of [i > k that exceed a 
scientifically substantial magnitude k > 0, and with 
a substantial probability. One then would use p(fi\z) 
in the formula above to calculate P(/i > k\z). 

Numerical illustrations are easy to do, and here 
is one based on the assumptions in Section 5 of the 
paper, with N = 3000, po = 0.9, and Normal dis- 
tributions with Zi ~ N(fj,i, 1) and g{n%) being the 
A r (2.5,0.5) density. Then values of z > 3.5 occur in 
2.1% of the genes, so z > 3.5 identifies about 63 of 
the 3000 genes. If we were to choose k = 2.8, then 
P(fi > 2.8\z) = 0.506 at the threshold value z = 3.5, 
and the conditional probability that /x > 2.8 rises as 
z increases. Researchers who wish to identify about 
63 genes (2.1%) would calculate P(^i > 2.8\zi) for 
every one of the 63 selected genes, all those that have 
at least a 50% chance of fj, > k = 2.8, and (by averag- 
ing) that overall about 60% of the 63 selected cases 
have Hi > 2.8. The 60% statement is analogous to 
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Benjamini-Hochberg's calculation, calculated here 
by averaging the 63 selected posterior probabilities. 
If a smaller value k = 2.0 were chosen, then selected 
genes at that threshold, still with z > 3.5, would 
have at least a 90% chance (90% if z = 3.5 exactly) 
that fx > 2.0, and one would know that about 95%, 
or 60 of the 63 selected cases, would have fj, > 2.0. 
Of course, if k = 0, as in the paper, then fdr(z) and 
F(z) would indicate that about 98% (61 or 62) of 
the 63 selected cases with z > 3.5 would have \i > 0. 

The preceding assumes a one-tailed test, as does 
Fdr, and so we have used k > (if large positive 
values of fi are wanted), but two-tailed probabili- 
ties also are easy to evaluate. A table of the N = 
3000 genes could list genes, sorted by their values 
of P(/j, > k\z), using p(fi\z). With exchangeability, 
the ordering is that of Z{. Researchers could review 
these values of P(/x > k\z), keeping as many genes 
as desired, and stop when this probability becomes 
too low, or when enough candidates have been ac- 
cepted. There is nothing special about keeping 2.1% 
and changing the cutoff for z would alter that per- 
centage. Experience gained with different values of 
k after a variety of analyses with various data sets 
eventually might help identify the scientifically most 
useful values. 

Of course g((i) and the other constants are not 
generally known. That is the point of Efron's paper, 
but g can be estimated by a variety of methods, 
frequentist, Bayesian and empirical Bayesian, and 
perhaps quite accurately with large N. The paper 
shows some nifty ways to estimate fx in exchange- 
able settings. Then one could use the estimated fx to 
estimate g(fi), perhaps by deconvolution methods. 
While estimating these mixing distributions g(-) be- 
comes more difficult in nonexchangeable cases when 
the Zi have different conditional distributions given 
Hi, the literature provides a variety of ways to do 
that, most easily in parametric settings. 

The proposal just described would test interval 
null hypotheses instead of single points by calculat- 
ing P(Hx) given the data, also by using the data to 
learn about various constants and distributions, for 
example, about Po,g(-), etc. Doing this in conjunc- 
tion with choosing a k > has been recommended in 
medical profiling by Burgess, Christiansen, Micha- 
lak and Morris (2000) for profiling hospital perfor- 
mances. Standard practice for medical profiling most 
commonly is based on testing different hypotheses 
like Hq : pij = independently, using standard meth- 
ods like those widely taught in beginning statis- 



tics courses. That forfeits the possibility of develop- 
ing more information via multilevel modeling. Once 
multilevel models have been fitted, it is natural to 
consider alternative hypotheses like Hx :fx>k where 
k > is chosen to set standards (k) for unacceptable 
(or laudatory) departures from average outcomes of 
medical procedures. The analogous proposal is made 
here, which can be extended to accommodate a spike 
at with po > within Hq = (—k, k) if required. 
That extension is not needed with medical profiling 
data, where it is unlikely that any sizeable fraction 
Po of hospitals would have precisely the same un- 
derlying rates of surgical outcomes, but the paper's 
applications make it clear that positive probability 
for a null point within Hq is appropriate in a variety 
of problems. 

In exchangeable cases, ranking according to 
p- values will not depend on the choice of k. With 
medical outcome data for hospitals, the number of 
treated patients always will vary substantially, pro- 
ducing nonexchangeability. Then shrinkages toward 
a common mean will be greater for small hospitals 
than for large ones, and the resulting rankings will 
depend not only on z%, but also on rtj and on k. 

4. NONEXCHANGEABILITY, THE SCHOOL 
DATA AND THE ONE-GROUP MODEL 

The school data of Figure 1(b) are not exchange- 
able because the sample sizes rtj (actually there are 
two different sample sizes for each school, one for 
each demographic group) surely vary across the N = 
3748 schools. Equal sample sizes might lead to ex- 
changeability, but that rarely happens except with 
designed experiments, as the microarray experiments 
must be. Together (e.g., in Efron and Morris, 1975), 
Brad and I once used toxoplasmosis summaries for 
N = 36 regions to illustrate generalizations of Stein's 
estimator that were needed to account for differ- 
ent sample sizes in different regions. Those toxo- 
plasmosis data, the hospital profiling data, and the 
school data in this paper all might be similarly mod- 
eled. The school data calculations suggest shrink- 
ages should vary, but average about 40%. A sharp 
null with pq much in excess of seems implausi- 
ble for toxoplasmosis, for hospital data, and for the 
school data, and so Efron introduces the case po = 
as his "one group model." One would then expect 
that Var(zj) is proportional to rij. That would cause 
longer-tailed distributions for the {zi} values than 
Normality allows, and schools with more students 
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would tend to be the outliers. Figure 1(b) reveals 
evidence of such long tails, corresponding to non- 
exchangeability. 

5. INTERVAL ESTIMATION 

Efron's Section 7, about interval estimation, shows 
in a simulation with exchangeable data that the 
FCR intervals are too wide. That happens because 
the FCR does not adjust its slope to be less than 
1.0 when a gentler slope closer to 0.5 would track 
regression toward the mean (RTTM) of the 1000 
random effects better. Interval estimates recentered 
according to this slope improvement can be shorter 
and still cover at the same rate as FCR does. Morris 
(1983) provides a basis for evaluating interval cov- 
erages via repeated sampling. Figure 1 in that 1983 
paper (data for N = 18 baseball players from some 
early Efron-Morris papers) illustrates how intervals 
centered on shrunken estimates are much more ac- 
curate. The graph there makes the same point that 
Efron does in Figure 8. However, Brad's Section 7 
conclusion avers that Bayesian intervals cannot be 
trusted. That does not square with my experience 
because I have found Bayesian reasoning to be es- 
sential to understanding how to construct interval 
estimates that have good frequency properties. 

With 1000 observations it makes sense to esti- 
mate the distribution g(n) without assuming Nor- 
mality, and instead to use exchangeability as a ba- 
sis for estimating the marginal distribution of the 
{z{}. The same can be done with Bayesian meth- 
ods, even with a nonparametric specification for g, 
although less easily. With unequal sample sizes, or 
when N is not large, a Bayesian approach may be 
more successful, as with the heart transplant data of 
Christiansen and Morris (1997). A key to knowing 
that Bayesianly constructed confidence intervals will 
meet frequency resampling criteria requires identi- 
fying and using frequency-friendly noninformative 
distributions for the hyperparameters. This has been 
done in a variety of specific parametric settings, in- 
cluding for some common generalized linear models. 
Bayesian reasoning also shows us how to account 
for added variability in settings where the hyper- 
parameters and shrinkage constants have been esti- 
mated. Such intervals must bow outward in Efron's 
Figure 8 when moving away from the center, and 
this is seen more dramatically when N = 18 in Fig- 
ure 1 of Morris (1983). Efron's Figure 8 shows no 
bowing, but that would be too small to see with 



such large sample sizes. More discussion is needed 
as to whether Bayesian reasoning really has failed in 
the Section 7 setting, and about what an empirical 
Bayes approach really can offer, beyond suggesting 
Bayesian methods designed to withstand frequency 
verifications. 

6. MODELING AND RTTM 

Two-level modeling can reveal by how much ran- 
dom effects will regress (shrink) toward the mean 
(RTTM). The modeling task is to estimate the mean 
to shrink toward, and determine how much shrink- 
age. A term I always liked that Brad used when we 
wrote together is "ensemble information." RTTM 
means individual estimates will regress toward the 
ensemble estimates. 

The paper focuses on rectangular X as an A-by-n 
data matrix. When X is rectangular, it is especially 
valuable to analyze the distribution of the rows and 
columns of X, calculating correlations as Brad does 
among the rows (genes) and/or among the columns 
(arrays) to improve estimates of fo, fx and po, and 
thereby to keep modeling assumptions at a mini- 
mum. 

Of course X need not be rectangular, nor should 
it automatically be so considered, because different 
rows sometimes may contain different amounts of 
data. The school data would follow such a nonrect- 
angular shape if each row were to include separate 
entries for each student (as the BRCA and the HIV 
data do, but always with the same sample sizes). In 
this case, the school data have been forced into a 
rectangular Procrustean bed by using summarized 
data Zi, and that has obscured their nonexchange- 
ability. 

Sometimes it pays to take advantage of situations 
when N is large, but without appealing to asymp- 
totics. In the context of the paper, that might be 
done by increasing the number of parameters and 
fitting richer models as N increases. This is para- 
metric model-building, but it is an alternative to 
nonparametric modeling. The paper does some of 
this to investigate correlations, but the same could 
be done to assess whether exchangeable models are 
adequate. 

A model for microarray data considered by Hongkai 
Ji and Wing Wong (Ji and Wong, 2005), concerns 
dealing with the (nuisance) standard deviations <7j 
that are estimated in the denominator of each t- 
statistic, like those considered in Sections 4 and 5 
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in Efron's paper. The sample standard deviations 
Si (each based on just a few degrees of freedom, as 
with the BRCA and HIV data) easily can produce 
randomly small sample standard deviations to es- 
timate <7j, and hence produce large i-values that 
falsely indicate which genes are expressing them- 
selves importantly. A way out of this is to consider 
the N problems to be exchangeable with respect to 
the random effects fii and also the a%. That justifies 
shrinkage methods (based on chi-squared distribu- 
tions). Ji shows that shrinking the sample standard 
deviations Sj toward their common mean, and using 
these empirical Bayes shrunken estimates in place 
of Si in the t-statistics, greatly improves the rate of 
false gene discoveries. 

7. CONCLUSION 

Brad Efron's paper introduces many ideas for an- 
alyzing massive datasets. It encourages a frequency- 
Bayes unification and empirical Bayes modeling. The 
paper identifies modeling and inference opportuni- 
ties that arise with massive datasets in exchange- 
able settings. Much remains to do to understand 
the exchangeable case for parametric and nonpara- 



metric models alike, and there is much to do to rec- 
ognize when nonexchangeable models are required, 
and how to fit them. 
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